-
Notifications
You must be signed in to change notification settings - Fork 456
prov/efa: Add support for blocking CQ read operations for RDM #11568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CCing @j-xiong for the |
j-xiong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fi_tostr changes look good to me.
|
can u you expand https://github.com/ofiwg/libfabric/blob/main/fabtests/pytest/efa/test_rdm.py#L165 and https://github.com/ofiwg/libfabric/blob/main/fabtests/pytest/efa/test_rma_bw.py#L90 to test sread/fd with efa fabric? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know if this PR is following some existing code examples in other providers? The logic looks reasonable to me but I am actually not super familiar with the ofi_wait functions yet
| return ret; | ||
| util_cq = container_of(fids[i], struct util_cq, cq_fid.fid); | ||
|
|
||
| /* RDM CQs use util wait objects, not hardware CQ events */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the benefits of using util wait instead of efa device interrupt to implement wait? Is it because there can be too many device level completions that is only for 1 libfabric completion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major requirement of the blocking cq read is to avoid burning the CPU which was the busy loop that application will do when sread is not available. Are we confident this approach is meeting this requirement? cc @bwbarrett
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is we'd need SHM to expose some kind of wait object (e.g. FI_WAIT_FD) in order to implement any kind of multiplexing with EFA device interrupts. I'd be more than happy to engineer a solution but I figured that would be out of scope given the context/urgency of the feature. Could we consider this a follow-up/future improvement?
Short of that, don't we need to periodically drive progress or risk missing completions? I figured the exponential backoff strategy with a minimum interval of 1ms was appropriate since it matches backoff strategies we employ elsewhere. The ofi_wait() calls block on the util CQ FD between retries, so it's not "spinning," per se.
I referenced util, CXI and a few others. I also originally added some defensive logic for a wait object for the SHM CQ FD here, but I opted to exclude it from the PR since it's purely for future proofing - SHM doesn't support |
| /* RDM CQs use util wait objects, not hardware CQ events */ | ||
| if (util_cq->wait) { | ||
| rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq); | ||
| ret = efa_rdm_cq_trywait(rdm_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it work on single node when shm is on? AFAICT it is only calling efa_rdm_cq_progress which doesn't progress shm cq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efa_rdm_cq_trywait() should now be progressing the SHM CQ via fi_cq_read()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fi_trywait won't process SHM completions and move them into the util CQ. My comment in efa_rdm_cq_trywait is probably misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fi_trywait won't process SHM completions and move them into the util CQ.
Are u saying fi_trywait for shm cq (no efa) will not progress shm completions, so you need that logic in fabtests? With your PR, fi_trywait for efa cq (+shm as peer) will progress both, right?
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
| } | ||
|
|
||
| /* Progress then wait */ | ||
| efa_rdm_cq_progress(util_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only progress efa, not shm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a miss. I'll add it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change it to something like fi_cq_read(efa, NULL, 0) it should progress both efa and shm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, replaced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it updated yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought I was commenting on a different line. The efa_rdm_cq_readfrom() call at the top of the loop is already progressing and flushing the SHM completions into the util CQ. Then we progress the EFA CQ to generate completions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but you may want to be consistent, efa_rdm_cq_progress doesn't really progress the whole cq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it
23446c1 to
5ac11a6
Compare
|
Added 65c9de6 after the previous changes exposed a potential use-after-free issue upon SHM CQ init failure. More of a "nice-to-have" since the Libfabric spec doesn't make guarantees about the value of the CQ FID upon cq_open failure |
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
|
|
||
| domain = container_of(cq->efa_cq.util_cq.domain, struct efa_domain, util_domain); | ||
|
|
||
| if (cq->shm_cq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just do fi_cq_read(NULL, 0) for efa, it should progress both efa and shm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, replaced.
|
Currently, any fabtests that explicitly test |
If we do not need shm support for efa rdm blocking cq read right now, would it be easier to turn off shm when wait is requested in cq open? |
Sure, let's try that |
5ac11a6 to
ab5a0d1
Compare
|
As I suspected, I don't think the SHM CQ is the issue here. Disabling SHM has no effect. The actual issue is that AFAICT, The latest commits I've pushed disables fabtests in the EFA pytest suite with |
|
Looks like my changes to skip fabtests with ETA: that'd be because I didn't add the logic for the RMA BW tests. I thought I implemented the skip for those but not RDM pingpong. Doesn't matter. |
ab5a0d1 to
6bd3e96
Compare
|
CC @j-xiong and/or @aingerson for the fabtests change: |
b1a045d to
bb5b409
Compare
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
| } | ||
|
|
||
| /* Progress then wait */ | ||
| efa_rdm_cq_progress(util_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it updated yet?
fabtests/common/shared.c
Outdated
| while (total - *cur > 0) { | ||
| ret = fi_cq_sread(cq, &comp, 1, NULL, timeout); | ||
| if (ft_fdwait_need_progress) | ||
| ret = fi_cq_read(cq, &comp, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why we need this ... cq_sread should progress the completions for a manual_progress provider anyway. also, sread is designed for wait obj FI_WAIT_UNSPEC, we shouldn't have a fdwait leak here even if provider may use wait_fd to implement sread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is solely a workaround for -U when the provider advertises manual progress. But now I'm thinking -U may not be valid for manual progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, -U is run fabtests with FI_DELIVERY_COMPLETE set, is it relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delivery-complete sends only complete after the peer consumes them. If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.
How to implement DC is provider specific, application doesn't have to poll its RX CQ to make the TX CQ on peer side to get the delivery completion ack. EFA provider fill this gap by making the Libfabric TX CQ poll also poll the rx device CQ internally. So the behavior you described should never happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you raised a good question. Today, the receiver needs to poll CQ so the sender can get completions for DC. I haven't checked in man page whether it is a legal behavior for manual progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadlock is technically compliant from the provider's perspective.
I see what happened here. I removed the unconditional exponential backoff when the wait object is FI_WAIT_FD after your initial comments re: spinning the CPU. For both fd and sread completion methods, the wait object is set to FI_WAIT_FD, so given an infinite timeout, it will hang forever after the initial manual progress. Perhaps for an infinite timeout, we fallback to the exponential backoff? On second thought, I think the deadlock may ultimately be more "correct" here. As you mentioned earlier, the app probably has a good reason to utilize the blocking CQ read for resource efficiency, so this seems like a necessary tradeoff to that end.
| /* RDM CQs use util wait objects, not hardware CQ events */ | ||
| if (util_cq->wait) { | ||
| rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq); | ||
| ret = efa_rdm_cq_trywait(rdm_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?
man/fi_efa.7.md
Outdated
| of RDM endpoint. | ||
|
|
||
| The `efa` fabric of RDM endpoint supports *FI_WAIT_FD* and *FI_WAIT_NONE* wait | ||
| objects for blocking CQ operations (*fi_cq_sread*). Applications should use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also FI_WAIT_UNSPEC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
| } | ||
|
|
||
| /* Progress then wait */ | ||
| efa_rdm_cq_progress(util_cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
| efa_rdm_cq_progress(util_cq); | ||
|
|
||
| /* Check if progress produced completions before blocking */ | ||
| if (OFI_LIKELY(!ofi_cirque_isempty(util_cq->cirq))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, so you have an is empty check here ... then a progress is required before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this and the preceding progress call
Signed-off-by: Darryl Abbate <[email protected]>
This detects when manual progress is required (FI_PROGRESS_MANUAL with FT_COMP_WAIT_FD, FT_COMP_SREAD, or FT_COMP_YIELD) and periodically calls fi_cq_read() to drive progress while waiting for completions. Signed-off-by: Darryl Abbate <[email protected]>
This prevents a dangling CQ FID pointer upon CQ open failure. While it's not strictly against the Libfabric spec, it's probably best not to leave the CQ FID set after a failure. Signed-off-by: Darryl Abbate <[email protected]>
This adds support for `fi_cq_sread{,from}`, `fi_control` and `fi_signal`
for the `efa` fabric. This also expands EFA's `fi_trywait()` to handle
wait objects for Util CQs.
Since the endpoint needs to progress in order to generate completions to
read from, an exponential backoff strategy is employed to poll by
driving progress periodically without excessively spinning the CPU for
higher latency scenarios. The `timeout` specified by the user is used to
cap the total time spent waiting for completions.
A CQ requested with FI_WAIT_FD indicates the app intends to drive
progress itself as needed, since EFA advertises FI_PROGRESS_MANUAL. In
this scenario, sreadfrom will block without periodically driving
progress.
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
Signed-off-by: Darryl Abbate <[email protected]>
bb5b409 to
8b8e568
Compare
shijin-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (remaining < 0) | ||
| return FT_FDWAIT_PROGRESS_INTERVAL_MS; | ||
|
|
||
| interval = MIN(remaining, FT_FDWAIT_PROGRESS_INTERVAL_MS); | ||
| return interval > 0 ? interval : FT_FDWAIT_PROGRESS_INTERVAL_MS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified as:
if (remaining <= 0)
return FT_FDWAIT_PROGRESS_INTERVAL_MS;
return MIN(remaining, FT_FDWAIT_INTERVAL_MS);
And the interval variable can be removed,
| int wait_timeout = ft_fdwait_need_progress | ||
| ? ft_fdwait_poll_timeout(remaining) | ||
| : remaining; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same condition is checked inside ft_fdwait_poll_timeout(). can eliminate this variable and use ft_fdwait_poll_timeout(remaining) directly.
| return 0; | ||
| return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no error path reaching here, can just return 0.
| } else if (ret < 0 && ret != -FI_EAGAIN) { | ||
| return ret; | ||
| } | ||
| ret = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes unnecessary as well.
This adds support for
fi_cq_sread{,from},fi_controlandfi_signalfor theefafabric. This also expands EFA'sfi_trywait()to handle wait objects for Util CQs.Since the endpoint needs to progress in order to generate completions to read from, an exponential backoff strategy is employed to poll by driving progress periodically without excessively spinning the CPU for higher latency scenarios. The
timeoutspecified by the user is used to cap the total time spent waiting for completions.